Comparing Two Approaches for Adding Feature Ranking to Sampled Ensemble Learning for Software Quality Estimation
نویسندگان
چکیده
High dimensionality and class imbalance are two main problems that affect the quality of training datasets in software defect prediction, resulting in inefficient classification models. Feature selection and data sampling are often used to overcome these problems. Feature selection is a process of choosing the most important attributes from the original data set. Data sampling alters the data set to change its balance level. Another technique, called boosting (building multiple models, with each model tuned to work better on instances misclassified by previous models), is found also effective for resolving the class imbalance problem. In particular, RUSBoost, which integrates random undersampling with AdaBoost, has been shown to improve classification performance for imbalanced training data sets. In this study, we investigated an approach for combining feature selection with this ensemble learning (boosting) process. We focused on two different scenarios: feature selection performed prior to the boosting process and feature selection performed inside the boosting process. Ten base feature ranking techniques and an ensemble ranker based on the ten were examined and compared over the two scenarios. The experimental results demonstrate that the ensemble feature ranking method generally had better or similar performance than the average of the base ranking techniques, and more importantly, the ensemble method exhibited better robustness than any other base ranking technique. As for the two scenarios, the results show that applying feature selection inside boosting performed better than using feature selection prior to boosting.
منابع مشابه
Fault Detection of Anti-friction Bearing using Ensemble Machine Learning Methods
Anti-Friction Bearing (AFB) is a very important machine component and its unscheduled failure leads to cause of malfunction in wide range of rotating machinery which results in unexpected downtime and economic loss. In this paper, ensemble machine learning techniques are demonstrated for the detection of different AFB faults. Initially, statistical features were extracted from temporal vibratio...
متن کاملMachine learning algorithms in air quality modeling
Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affec...
متن کاملCombining Feature Selection and Ensemble Learning for Software Quality Estimation
High dimensionality is a major problem that affects the quality of training datasets and therefore classification models. Feature selection is frequently used to deal with this problem. The goal of feature selection is to choose the most relevant and important attributes from the raw dataset. Another major challenge to building effective classification models from binary datasets is class imbal...
متن کاملBridging the semantic gap for software effort estimation by hierarchical feature selection techniques
Software project management is one of the significant activates in the software development process. Software Development Effort Estimation (SDEE) is a challenging task in the software project management. SDEE is an old activity in computer industry from 1940s and has been reviewed several times. A SDEE model is appropriate if it provides the accuracy and confidence simultaneously before softwa...
متن کاملارائه الگوریتمی مبتنی بر یادگیری جمعی به منظور یادگیری رتبهبندی در بازیابی اطلاعات
Learning to rank refers to machine learning techniques for training a model in a ranking task. Learning to rank has been shown to be useful in many applications of information retrieval, natural language processing, and data mining. Learning to rank can be described by two systems: a learning system and a ranking system. The learning system takes training data as input and constructs a ranking ...
متن کامل